library(ggplot2)
library(dplyr)
library(rmarkdown)
library(knitr)
library(GEOquery)
library(SummarizedExperiment)
library(DESeq2)
library(here)
library(limma)
library(nlme)
library(MASS)
library(multcomp)
library(pheatmap)
library(variancePartition)
library(lme4)
library("r2glmm")
library(mclust)
library(gprofiler2)
library(enrichplot)
Data for the study on trivalent influenza vaccine are deposited in National Center for Biotechnology Information (NCBI) Gene Expression Omnibus (GEO) with accession GSE48024. It was published in 2013.
A total of 119 male and 128 female between 19 to 41 years of age were recruited in the study. Whole blood samples were drawn before vaccination (day 0) and at three time points after vaccination (day 1, 3 and 14). All male samples are sequenced using Illumina HumanHT-12 V3.0 expression beadchip and all female samples are sequence using Illumina HumanHT-12 V4.0 expression beadchip.
Two separate datasets are loaded, one for female cohort and another one for male cohort.
gset <- getGEO("GSE48024", GSEMatrix =TRUE, AnnotGPL=TRUE)
## Found 2 file(s)
## GSE48024-GPL10558_series_matrix.txt.gz
## GSE48024-GPL6947_series_matrix.txt.gz
gset_female <- gset[[1]]
gset_male <- gset[[2]]
# Check gender
unique(gset_female$`gender:ch1`)
## [1] "female"
# Check gender
unique(gset_male$`gender:ch1`)
## [1] "male"
# Check sequencing platform
unique(gset_female$platform_id) #Illumina HumanHT-12 V4.0 expression beadchip
## [1] "GPL10558"
# Check sequencing platform
unique(gset_male$platform_id) #Illumina HumanHT-12 V3.0 expression beadchip
## [1] "GPL6947"
Based on the data loaded directly from GEO, 116 out of 119 males and 110 out of 128 females are available. A total of 417 samples are from female cohort and 431 samples are from male cohort.
# Check number of subjects available in data
length(unique(gset_female$`subject:ch1`))
## [1] 110
# Check number of subjects available in data
length(unique(gset_male$`subject:ch1`))
## [1] 116
# Check number of samples available in data
ncol(gset_female)
## Samples
## 417
# Check number of samples available in data
ncol(gset_male)
## Samples
## 431
There are 47276 probes used for female cohort, and 48742 probes used for male cohort.
# Check number of probes
nrow(gset_female)
## Features
## 47276
# Check number of probes
nrow(gset_male)
## Features
## 48742
Data has been processed and normalized according to description given in GEO. We randomly select 10 samples from each cohort to check the distribution of normalized data.
# Data processing + normalization
unique(gset_male$data_processing)
## [1] "Raw data underwent background adjustment, variance stabilization transformation, and robust spline normalization using R package lumi (available from Bioconductor) and function lumiExpresso()"
# Data processing + normalization
unique(gset_female$data_processing)
## [1] "Raw data underwent background adjustment, variance stabilization transformation, and robust spline normalization using R package lumi (available from Bioconductor) and function lumiExpresso()"
boxplot(exprs(gset_male)[,sample(1:431, 10, replace=F)],
main = "Ten random samples from male cohort", ylab = "Expression")
boxplot(exprs(gset_female)[,sample(1:417, 10, replace=F)],
main = "Ten random samples from female cohort", ylab = "Expression")
Last, we check how many samples at each sampling time for each cohort.
# For female
table(gset_female$`time:ch1`)
##
## Day0 Day1 Day14 Day3
## 107 107 98 105
# For male
table(gset_male$`time:ch1`)
##
## Day0 Day1 Day14 Day3
## 111 110 109 101
Because female and male cohorts are sequenced using different platforms, all further data processing and analysis will be done separately.
First to setup SummarizedExperiment object.
SE_male <- SummarizedExperiment(assay = list("exprs" = exprs(gset_male)),
colData = as(gset_male@phenoData, "data.frame"),
rowData = as(gset_male@featureData, "data.frame"))
SE_female <- SummarizedExperiment(assay = list("exprs" = exprs(gset_female)),
colData = as(gset_female@phenoData, "data.frame"),
rowData = as(gset_female@featureData, "data.frame"))
# Create few variables with better names
SE_male$time <- factor(SE_male$`time:ch1`, levels = c("Day0", "Day1", "Day3", "Day14"))
SE_male$ID <- SE_male$`subject:ch1`
SE_female$time <- factor(SE_female$`time:ch1`, levels = c("Day0", "Day1", "Day3", "Day14"))
SE_female$ID <- SE_female$`subject:ch1`
As we have seen in Data Collection section, not all participants are sampled at all four time points. Thus, we keep those who sampled at all time.
male.all <- as.data.frame(table(SE_male$ID)) %>% filter(Freq == 4) %>% rename( "Var1" = "ID")
female.all <- as.data.frame(table(SE_female$ID)) %>% filter(Freq == 4) %>% rename( "Var1" = "ID")
SE_male <- SE_male[, which(SE_male$ID %in% male.all$ID)]
SE_female <- SE_female[, which(SE_female$ID %in% female.all$ID)]
There are 92 males and 87 females sampled at all four time points. These will be used for downstream analysis.
table(SE_male$time)
##
## Day0 Day1 Day3 Day14
## 92 92 92 92
table(SE_female$time)
##
## Day0 Day1 Day3 Day14
## 87 87 87 87
Probes for male cohort: 48742 -> 29228
Probes for female cohort: 47276 -> 31243
gset_male <- gset_male[which(gset_male@featureData@data[["Gene symbol"]] != ""), ]
nrow(gset_male)
## Features
## 29228
gset_female <- gset_female[which(gset_female@featureData@data[["Gene symbol"]] != ""), ]
nrow(gset_female)
## Features
## 31243
Using limma() package.
Probes for male cohort: 29228 -> 19590
Probes for female cohort: 31243 -> 20751
dupRM_male <- avereps(assays(SE_male)$exprs, ID=rowData(SE_male)$"Gene symbol")
nrow(dupRM_male)
## [1] 19590
dupRM_female <- avereps(assays(SE_female)$exprs, ID=rowData(SE_female)$"Gene symbol")
nrow(dupRM_female)
## [1] 20751
Update SummarizeExperiment object with updated expression matrix.
# Make sure samples are ordered the same
all.equal(colnames(dupRM_male) , rownames(SE_male@colData))
## [1] TRUE
SE_male_dupRM <- SummarizedExperiment(assay = list("exprs" = dupRM_male),
colData = SE_male@colData)
all.equal(colnames(dupRM_female) , rownames(SE_female@colData))
## [1] TRUE
SE_female_dupRM <- SummarizedExperiment(assay = list("exprs" = dupRM_female),
colData = SE_female@colData)
Probes for male cohort: 19590 -> 14692
Probes for female cohort: 20751 -> 15563
rowData(SE_male_dupRM)$gene_variance <- rowVars(assays(SE_male_dupRM)$exprs)
quantile(rowData(SE_male_dupRM)$gene_variance)
## 0% 25% 50% 75% 100%
## 5.479204e-05 2.238396e-03 3.938472e-03 1.344815e-02 2.601662e+00
SE_male_dupRM_25up <- SE_male_dupRM[which(rowData(SE_male_dupRM)$gene_variance > 0.0022386139), ]
rowData(SE_female_dupRM)$gene_variance <- rowVars(assays(SE_female_dupRM)$exprs)
quantile(rowData(SE_female_dupRM)$gene_variance)
## 0% 25% 50% 75% 100%
## 0.0002699308 0.0070110061 0.0134392322 0.0334073424 4.5899802389
SE_female_dupRM_25up <- SE_female_dupRM[which(rowData(SE_female_dupRM)$gene_variance > 0.0070110061), ]
For downstream analysis, we use 14692 unique genes and 368 samples from the male cohort, and 15563 genes with 348 samples from the female cohort.
SE_male_dupRM_25up
## class: SummarizedExperiment
## dim: 14692 368
## metadata(0):
## assays(1): exprs
## rownames(14692): EEF1A1 GAPDH ... MCM10 ZNF703
## rowData names(1): gene_variance
## colnames(368): GSM1165057 GSM1165058 ... GSM1165486 GSM1165487
## colData names(43): title geo_accession ... time ID
SE_female_dupRM_25up
## class: SummarizedExperiment
## dim: 15563 348
## metadata(0):
## assays(1): exprs
## rownames(15563): EEF1A1 GAPDH ... MIR320C1 LINC00173
## rowData names(1): gene_variance
## colnames(348): GSM1165512 GSM1165513 ... GSM1165927 GSM1165928
## colData names(43): title geo_accession ... time ID
Data on antibody titer are deposited on dbGaP with restricted access. Thus, we use 11 differentiallyt expressed genes with respect to response given by authors to differentiate high response and low response after vaccination. These 11 genes are found using male cohort.
Gene PRDX2 are not present in both female and male cohort after data pre-processing.
DE11_response <- c("STAT1", "IRF9", "SPI1", "CD74", "HLA-E", "TNFSF13B", "PRDX2", "PRDX3", "E2F2", "PTEN", "ITGB1")
DE11_response[!DE11_response %in% rownames(SE_male_dupRM_25up)]
## [1] "PRDX2"
DE11_response[!DE11_response %in% rownames(SE_female_dupRM_25up)]
## [1] "PRDX2"
We will cluster participants based on difference between day1 and day0 expression.
male_exprs_day1_0 <- list( day1 = SE_male_dupRM_25up[which(rownames(SE_male_dupRM_25up) %in% DE11_response),which(SE_male_dupRM_25up$time == "Day1")],
day0 = SE_male_dupRM_25up[which(rownames(SE_male_dupRM_25up) %in% DE11_response),which(SE_male_dupRM_25up$time == "Day0")])
colnames(male_exprs_day1_0[["day1"]]) <- male_exprs_day1_0[["day1"]]$ID
colnames(male_exprs_day1_0[["day0"]]) <- male_exprs_day1_0[["day0"]]$ID
male_exprs_day1_0[["day1"]] <- as.matrix(assay(male_exprs_day1_0[["day1"]]))
male_exprs_day1_0[["day0"]] <- as.matrix(assay(male_exprs_day1_0[["day0"]]))
dim(male_exprs_day1_0[[1]])
## [1] 10 92
dim(male_exprs_day1_0[[2]])
## [1] 10 92
# Match the order of gene and order of participants
male_exprs_day1_0[[2]] <- male_exprs_day1_0[[2]][match(rownames(male_exprs_day1_0[[1]]), rownames(male_exprs_day1_0[[2]])), match(colnames(male_exprs_day1_0[[1]]), colnames(male_exprs_day1_0[[2]]))]
# Subtract day 0 from day 1 for each participants
male_diff_day1_0 <- (male_exprs_day1_0[["day1"]] - male_exprs_day1_0[["day0"]])
male_diff_day1_0[,1:5]
## IN0123 IN0126 IN0127 IN0128 IN0129
## STAT1 1.31173507 1.52068176 1.02966207 1.22048660 1.33060734
## SPI1 0.12843323 -0.08736426 -0.49973666 0.07800277 0.56261461
## PTEN -0.26807272 0.31520774 -0.45989548 0.23892477 0.35092521
## ITGB1 -0.03593713 0.03315031 -0.28064441 -0.18637363 -0.09974113
## CD74 0.53791679 -0.11917102 0.30187478 0.32976888 0.45044398
## IRF9 0.05764101 0.91409226 0.59053525 0.74820827 0.52649910
## TNFSF13B -0.18962517 0.53251828 0.94899337 0.36595157 0.41775762
## HLA-E -0.07087181 0.35358275 0.50044553 1.16363470 0.05931064
## E2F2 0.34867212 0.58911630 -0.04331496 -0.01942406 0.29831446
## PRDX3 -0.03653738 -0.03568729 -0.42529887 -0.09805881 -0.14152018
# GMM cluster
GMM_male <- Mclust(t(male_diff_day1_0), G = 2)
table(GMM_male$classification)
##
## 1 2
## 27 65
According to authors, genes (“STAT1”, “IRF9”, “SPI1”, “CD74”, “HLA-E”, “TNFSF13B”) are up-regulated in high responders, and remaining genes in the list are up-regulated in low responders.
We assign group 2 as high responders based on boxplots on the difference of expression between day 1 and day 0 by clusters from GMM.
# Check GMM cluster result wrt 10 genes
par(mfrow=c(2,5))
for (i in 1:nrow(male_diff_day1_0)){
dff <- data.frame(expr = male_diff_day1_0[rownames(male_diff_day1_0)[i],], group=GMM_male$classification)
boxplot(expr ~ group,dff, main=rownames(male_diff_day1_0)[i], ylab="diff(day1- day0)")
}
SE_male_dupRM_25up$response <- factor(GMM_male$classification[match(SE_male_dupRM_25up$ID, names(GMM_male$classification))], levels = c(1,2))
Please refer to the Rmd file for code.
GMM result.
##
## 1 2
## 48 39
The 10 genes used for clustering are found using male cohort, and analysis using GMM shows inconsistencies when applied on female cohort. Overall, we assign group 1 as high responders and group 2 as low responders for female cohort.
We first define few functions used for analysis.
mixed_test <- function(gene,gene_name, covariates, fml, var_rowname, j){
df_test <- cbind(gene , covariates)
colnames(df_test)[1] <- "gene"
mix_model <- aov(fml, data=df_test)
df_ret <- data.frame(unlist(summary(mix_model))[paste0("Error: Within.Pr(>F)",j)])
rownames(df_ret) <- var_rowname
colnames(df_ret) <- gene_name
return(tibble(df_ret))
}
GO_analysis <- function(genes){
DE <- gost(query = genes, organism = "hsapiens", multi_query = FALSE,
correction_method = "fdr", user_threshold = 0.01, source = "GO:BP")
print(table(DE[["result"]][["source"]]))
DE_mod = DE$result[,c("query", "term_id",
"term_name", "p_value", "query_size",
"intersection_size", "term_size",
"effective_domain_size", "intersection_size", "source")]
DE_mod$"GeneRatio" = paste0(DE_mod$intersection_size, "/", DE_mod$query_size)
DE_mod$BgRatio = paste0(DE_mod$term_size, "/", DE_mod$effective_domain_size)
DE_mod$"pvalue" <- DE_mod$p_value
DE_mod$Count <- DE_mod$intersection_size
DE_mod$Description<- DE_mod$term_name
DE_mod_enrich <- new("enrichResult", result = DE_mod)
dotplot(DE_mod_enrich , x = 'GeneRatio', color = "pvalue") + ggtitle("GO: Biological process")
}
We first fit mixed effect model, then extract p-value from testing the effect on time variable which is treated as categorical variable.
cov_male <- data.frame(ID = SE_male_dupRM_25up$ID,
time = SE_male_dupRM_25up$time,
response = SE_male_dupRM_25up$response)
genes_male <- rownames(SE_male_dupRM_25up)
mix_test_time_male <- sapply( 1: nrow(SE_male_dupRM_25up),
function(i) mixed_test( gene = as.vector(SE_male_dupRM_25up@assays@data@listData[["exprs"]][i,]),
gene_name = genes_male[i],
covariates = cov_male,
fml = as.formula(gene ~ time + Error(ID)),
var_rowname = c("time"),
j = 1))
mix_test_time_male <- as.data.frame(unlist(mix_test_time_male))
colnames(mix_test_time_male) <- c("pvalue")
hist(mix_test_time_male$pvalue, main ="P-value from fix effect of time", xlab="P-value")
P-values are adjusted using Benjamini-Hotchberg method.
# BH adjusted p-value
mix_test_time_male$BH_pvalue <- p.adjust(mix_test_time_male$pvalue, method = "BH", n = length(mix_test_time_male$pvalue))
hist(mix_test_time_male$BH_pvalue, main ="BH adjusted p-value from fix effect of time", xlab="BH adjusted p-value")
Controlling for 1% of false discovery rate, there are 4540 differentially expressed (DE) genes across time in the male cohort.
DEtime_male <- mix_test_time_male %>% as.data.frame() %>% filter(BH_pvalue < 0.01)
nrow(DEtime_male)
## [1] 4540
Next, we plot heatmap on top 50 DE genes by BH adjusted p-value. For male cohort, the top 50 DE genes are predominantly up-regulated at day 1 after vaccination. This is consistent with what authors reported in their paper.
DE50_male <- SE_male_dupRM_25up[which(rownames(SE_male_dupRM_25up) %in% rownames(DEtime_male %>% arrange(BH_pvalue))[1:50]), ]
colData(DE50_male)$time <- factor(colData(DE50_male)$time, levels = c("Day0", "Day1", "Day3", "Day14"))
DE50_male_mean <- sapply(c("Day0", "Day1", "Day3", "Day14"), function(day) rowMeans(assay(DE50_male)[, which(colData(DE50_male)$time == day)]))
pheatmap(DE50_male_mean, cluster_rows = TRUE, cluster_cols = FALSE, scale = "row",
main = "Mean expression of top 50 DE genes wrt time by BH adjusted p-values")
Next, we perform Gene Ontology enrichment analysis on the 4540 DE genes. Specifically, we are interested in biological process.
There are 1550 GO terms showed significant association with 4540 DE genes found with respect to time. Some of the top GO term related to immune response and stress are what we expected after vaccination.
GO_analysis(rownames(DEtime_male))
##
## GO:BP
## 1550
Variable response is inferred using GMM in previous section.
mix_test_response_time_male <- sapply( 1: nrow(SE_male_dupRM_25up),
function(i) mixed_test( gene = as.vector(SE_male_dupRM_25up@assays@data@listData[["exprs"]][i,]),
gene_name = genes_male[i],
covariates = cov_male,
fml = as.formula(gene ~ time*response + Error(ID)),
var_rowname = c("time"),
j = 2))
mix_test_response_time_male <- as.data.frame(unlist(mix_test_response_time_male))
colnames(mix_test_response_time_male) <- c("pvalue")
hist(mix_test_response_time_male$pvalue, main ="P-value from fix effect of time*response", xlab="P-value")
# BH adjusted p-value
mix_test_response_time_male$BH_pvalue <- p.adjust(mix_test_response_time_male$pvalue, method = "BH", n = length(mix_test_response_time_male$pvalue))
hist(mix_test_response_time_male$BH_pvalue, main ="BH adjusted p-value from fix effect of time*response", xlab="BH adjusted p-value")
Controlling for 5% of false discovery rate, there are 573 differentially expressed (DE) genes across time in the male cohort.
DEresponse_male <- mix_test_response_time_male %>% as.data.frame() %>% filter(BH_pvalue < 0.05)
nrow(DEresponse_male)
## [1] 573
We plot heatmap on top 50 DE genes by BH adjusted p-value. For male cohort, the top 50 DE genes are predominantly up-regulated at day 1 among high responders after vaccination. There are few genes that are down-regulated at day 1among high responders after vaccination.
DE50response_male <- SE_male_dupRM_25up[which(rownames(SE_male_dupRM_25up) %in% rownames(DEresponse_male %>% arrange(BH_pvalue))[1:50]), ]
colData(DE50response_male)$time <- factor(colData(DE50response_male)$time, levels = c("Day0", "Day1", "Day3", "Day14"))
DE_response1_mean_male <- sapply(c( "Day0", "Day1", "Day3", "Day14"), function(day) rowMeans(assay(
DE50response_male)[, which(colData(
DE50response_male)$time == day & DE50response_male$response == 1)]))
colnames(DE_response1_mean_male) <- paste0(colnames(DE_response1_mean_male), "_low")
DE_response2_mean_male <- sapply(c( "Day0", "Day1", "Day3", "Day14"), function(day) rowMeans(assay(
DE50response_male)[, which(colData(
DE50response_male)$time == day & DE50response_male$response == 2)]))
colnames(DE_response2_mean_male) <- paste0(colnames(DE_response2_mean_male), "_high")
Group_male <- data.frame( Group = c(rep("Low responder",4), rep("High responder", 4)))
rownames(Group_male) <- c(colnames(DE_response1_mean_male), colnames(DE_response2_mean_male))
pheatmap(cbind(DE_response1_mean_male,DE_response2_mean_male),
cluster_rows = TRUE, cluster_cols = FALSE, scale = "row",
main = "Mean expression on top 50 DE genes wrt response over time",
annotation_col = Group_male)
There are 483 GO terms showed significant association with 573 DE genes found with respect to response over time. The top GO terms are mostly related to immune response.
GO_analysis(rownames(DEresponse_male))
##
## GO:BP
## 483
Please refer to Rmd file for code.
We fit mixed effect model, and extract p-value on testing the fixed effect of interaction between response and time. P-values are further adusted using Benjamini-Hotchberg method.
Controlling for 1% of false discovery rate, there are 6118 differentially expressed (DE) genes across time in the male cohort.
## [1] 6118
We plot heatmap on top 50 DE genes by BH adjusted p-value. For female cohort, the top 50 DE genes are all up-regulated at day 1 after vaccination.
There are 1682 GO terms showed significant association with 6118 DE genes found with respect to time. Some of the top GO term related to immune response and stress are what we expected after vaccination.
##
## GO:BP
## 1682
Response variable is inferred using GMM in previous section. We fit a mixed effect model including interaction between response and time as fixed effect. Then p-values on testing the effect of the interaction is extracted, then adjusted using Benjamini-Hotchberg method.
Controlling for 5% of false discovery rate, there are 3217 differentially expressed (DE) genes across time in the male cohort.
## [1] 3217
We plot heatmap on top 50 DE genes by BH adjusted p-value. Among the top 50 DE genes, majority are down-regulated at all time points among high responders after vaccination. This is different than the result shown for male analysis.
There are 730 GO terms showed significant association with 3217 DE genes found with respect to response over time. The top GO terms are different as compared to the analysis on male cohort.
##
## GO:BP
## 730